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Abstract 

We  propose  a  novel  method  for  using  the  World  Wide  Web  to  acquire  trigram  estimates  for  statistical  lan¬ 
guage  modeling.  We  submit  an  N-gram  as  a  phrase  query  to  web  search  engines.  The  search  engines  return 
the  number  of  web  pages  containing  the  phrase,  from  which  the  N-gram  count  is  estimated.  The  N-gram 
counts  are  then  used  to  form  web-based  trigram  probability  estimates.  We  discuss  the  properties  of  such 
estimates,  and  methods  to  interpolate  them  with  traditional  corpus  based  trigram  estimates.  We  show  that 
the  interpolated  models  improve  speech  recognition  word  error  rate  significantly  over  a  small  test  set. 
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1  Introduction 


A  language  model  is  a  critical  component  for  many  applications,  including  speech  recognition.  Enormous 
effort  has  been  spent  on  building  and  improving  language  models.  Broadly  speaking,  this  effort  develops 
along  two  orthogonal  directions:  The  first  direction  is  to  apply  increasingly  sophisticated  estimation  methods 
to  a  fixed  training  data  set  (corpus)  to  achieve  better  estimation.  Examples  include  various  interpolation  and 
backoff  schemes  for  smoothing,  variable  length  N-grams,  vocabulary  clustering,  decision  trees,  probabilistic 
context  free  grammar,  maximum  entropy  models,  etc  [1].  We  can  view  these  methods  as  trying  to  “squeeze 
out”  more  benefit  from  a  fixed  corpus.  The  second  direction  is  to  acquire  more  training  data.  However, 
automatically  collecting  and  incoiporating  new  training  data  is  non-trivial,  and  there  has  been  relatively 
little  research  in  this  direction.  Adaptive  models  are  examples  of  the  second  direction.  For  instance,  a  cache 
language  model  uses  recent  utterances  as  additional  training  data  to  create  better  N-gram  estimates.  The 
recent  rapid  development  of  the  World  Wide  Web  (WWW)  makes  it  an  extremely  large  and  valuable  data 
source.  Just-in-time  language  modeling  [2]  submits  previous  user  utterances  as  queries  to  WWW  search 
engines,  and  uses  the  retrieved  web  pages  as  unigram  adaptation  data.  In  this  paper,  we  propose  a  novel 
method  for  using  the  WWW  and  its  search  engines  to  derive  additional  training  data  for  N-gram  language 
modeling,  and  show  significant  improvements  in  terms  of  speech  recognition  word  error  rate. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  gives  the  outline  of  our  method,  and  discusses 
the  relevant  properties  of  the  WWW  and  search  engines.  Section  3  investigates  the  problem  of  combining 
a  traditional  corpus  with  data  from  the  web.  Section  4  presents  our  experimental  results.  Finally  Section  5 
discusses  both  the  potential  and  the  limitations  of  our  proposed  method,  and  lists  some  possible  extensions. 


2  The  WWW  as  trigram  training  data 


The  basic  problem  in  trigram  language  modeling  is  to  estimate  p(w3|tiq,  w2),  i.e.  the  probability  of  a  word 
given  the  two  words  preceding  it.  This  is  typically  done  by  smoothing  the  maximum  likelihood  estimate 


.  ,  I  ,  c{wiw2w3) 
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with  various  methods,  where  c(wiw2w3)  and  c(u>iw2)  are  the  counts  of  " uq iro i/'s"  and  nwiw2n  in  some 
training  data  respectively.  The  main  idea  behind  our  method  is  to  obtain  the  counts  of  ”wiw2w3”  and 
"wiw2”  as  they  appear  on  the  WWW,  to  estimate 


Pweb(w3\wi,W2) 


Cwebjw  iW2W3) 
Cweb{wiW2) 


and  combine  pweb  with  the  estimates  from  a  traditional  corpus  (here  and  elsewhere,  when  cweb(wiw2)  =  0, 
we  regard  pweb(w3\wi,  w2)  as  unavailable).  Essentially,  we  are  using  the  searchable  web  as  additional 
training  data  for  trigram  language  modeling. 

There  are  several  questions  to  be  addressed.  First,  how  to  obtain  the  counts  from  the  web?  What  is  the 
quality  of  these  web  estimates?  How  could  they  be  used  to  improve  language  modeling?  We  will  examine 
these  questions  in  the  following  sections,  in  the  context  of  N-best  list  rescoring  for  speech  recognition. 


2.1  Obtaining  N-gram  counts  from  the  WWW 

To  obtain  the  count  of  an  N-gram  ”  uq . . .  wn"  from  the  web,  we  use  the  ‘exact  phrase  search’  function 
of  web  search  engines.  That  is,  we  send  ”  uq  . . .  wnn  as  a  single  quoted  phrase  query  to  a  search  engine. 
Ideally,  we  would  like  the  search  engine  to  report  the  phrase  count,  i.e.  the  total  number  of  occurrences  of 
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the  phrase  in  all  its  indexed  web  pages.  However  in  practice,  most  search  engines  only  report  the  web  page 
count,  i.e.  the  number  of  web  pages  containing  the  phrase.  Since  one  web  page  may  contain  one  or  more 
occurrence  of  the  phrase,  we  need  to  estimate  the  phrase  count  from  the  web  page  count. 

Many  web  search  engines  claim  they  can  perform  exact  phrase  search.  However,  most  of  them  seem  to 
use  an  internal  stop  word  list  to  remove  common  words  from  a  query  phrase.  An  interesting  test  phrase  is 
“to  be  or  not  to  be”:  Some  search  engines  return  totally  irrelevant  web  pages  for  this  query,  since  most,  if 
not  all,  words  are  ignored.  In  addition,  a  few  search  engines  perform  stemming  so  the  query  “she  say”  will 
return  some  web  pages  only  containing  “she  says”  or  “she  said”.  Furthermore,  some  search  engines  report 
neither  phrase  counts  nor  web  page  counts.  We  experimented  with  a  dozen  popular  search  engines,  and 
found  three  that  meet  our  criteria:  AltaVista  [3]  advanced  search  mode,  Lycos  [4],  and  FAST  [5]  1 .  They  all 
report  web  page  counts. 

One  brute  force  method  to  get  the  phrase  counts  is  to  actually  download  all  the  web  pages  the  search 
engine  finds.  However,  queries  of  common  words  typically  result  in  tens  of  thousands  of  web  pages,  and 
this  method  is  clearly  infeasible.  Fortunately  at  the  time  of  our  experiment  AltaVista  had  a  simple  search 
mode,  which  reported  both  the  phrase  count  and  the  web  page  count  for  a  query.  Figure  1  shows  the  phrase 
count  vs.  web  page  count  for  1200  queries.  Trigram  queries  (phrases  consisting  of  three  consecutive  words), 
bigram  queries  and  unigram  queries  are  plotted  separately.  There  are  horizontal  branches  in  the  bigram  and 
trigram  plots  that  don’t  make  sense  (more  web  pages  than  total  phrase  counts).  We  regard  these  as  outliers 
due  to  idiosyncrasies  of  the  search  engine,  and  exclude  them  from  further  consideration.  The  three  plots  are 
largely  log-linear.  This  prompted  us  to  perform  the  following  log-linear  regression  separately  for  trigrams, 
bigrams,  and  unigrams: 


C  =  Q0  *  p(f  ‘ 

where  c  is  the  phrase  count,  and  pg  the  web  page  count.  Table  1  lists  the  coefficients.  The  three 
regression  functions  are  also  plotted  in  Figure  1 .  We  assume  these  functions  apply  to  other  search  engines  as 
well.  In  the  rest  of  the  paper,  all  web  N-gram  counts  are  estimated  by  applying  the  corresponding  regression 
function  to  the  web  page  counts  reported  by  search  engines. 


webpage  number  returned  by  search  engines 


Figure  1 :  Web  phrase  count  vs.  web  page  count 


1  Our  selection  is  admittedly  incomplete.  In  addition,  since  search  engines  develop  and  change  rapidly,  all  our  comments  are 
only  valid  during  the  period  of  this  experiment. 
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Q'O 

Ql 

Unigram 

2.427 

1.019 

Bigram 

1.209 

1.014 

Trigram 

1.174 

1.025 

Table  1 :  Coefficients  of  log-linear  regression  for  estimating  Web  N-gram  counts  from  Web  page  counts 
reported  by  search  engines. 

*  2.2  The  quality  of  web  estimates 

To  investigate  the  quality  of  web  estimates,  we  needed  a  baseline  corpus  for  comparison.  The  baseline  we 
used  is  a  103  million  word  Broadcast  News  corpus. 

2.2.1  Web  N-gram  coverage 

The  first  experiment  we  ran  was  N-gram  coverage  test  on  unseen  text.  That  is,  we  wanted  to  see  how  many 
N-grams  in  the  test  text  are  not  on  the  web,  and/or  not  in  the  baseline  corpus.  We  were  hoping  to  show 
that  the  web  covers  many  more  N-grams  than  the  baseline  corpus.  Note  that  by  ‘the  web’  we  mean  the 
searchable  portion  of  the  web  as  indexed  by  the  search  engines  we  chose. 

The  unseen  news  test  text  consisted  of  24  randomly  chosen  sentences  from  4  web  news  sources  (CNN, 
ABC,  Fox,  BBC)  and  6  categories  (world,  domestic,  technology,  health,  entertainment,  politics).  All  the 
sentences  were  selected  from  the  day’s  news  stories,  on  the  day  the  experiment  was  carried  out.  This  was 
to  make  sure  that  the  search  engines  hadn’t  had  the  time  to  index  the  web  pages  containing  these  sentences. 
After  the  experiment  was  completed,  we  checked  each  sentence,  and  indeed  none  of  them  were  found  by  the 
search  engines  yet.  Therefore  the  test  text  is  truly  unseen  to  both  the  web  search  engines  and  the  baseline 
corpus.  (The  test  text  is  of  written  news  style,  which  might  be  slightly  different  from  the  broadcast  news 
style  in  the  baseline  corpus.) 

There  are  327  unigram  types  (i.e.  unique  words),  462  bigram  types  and  453  trigram  types  in  the  test 
text.  Table  2  lists  the  number  of  N-gram  types  not  covered  by  the  different  search  engines  and  the  baseline 
corpus,  respectively. 


Unique  Types 

Not  Covered  By 

AltaVista 

Lycos 

FAST 

Corpus 

Unigram 

327 

0 

0 

0 

8 

Bigram 

462 

4 

5 

5 

68 

Trigram 

453 

46 

46 

46 

189 

Table  2:  Novel  N-gram  types  in  24  news  sentences 

Clearly,  the  web’s  coverage,  under  any  of  the  search  engines,  is  much  better  than  that  of  the  baseline 
corpus.  It  is  also  worth  noting  that  for  this  test  text,  any  N-gram  not  covered  by  the  web  was  also  not  covered 
by  the  baseline  corpus. 

In  the  next  experiment,  we  focused  on  the  trigrams  in  the  test  text  to  answer  the  question  “if  one  ran¬ 
domly  picks  a  trigram  from  the  test  text,  what’s  the  chance  the  trigram  has  appeared  c  times  in  the  training 
data?”  Figure  2  shows  the  comparison,  with  the  training  data  being  the  baseline  corpus  and  the  web  through 
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Figure  2:  Empirical  frequency-frequency  plot 


the  different  search  engines,  respectively.  This  figure  is  also  known  as  a  “frequency-of-frequency”  plot.  Ac¬ 
cording  to  this  figure,  a  trigram  from  the  test  text  has  more  than  40%  chance  of  being  absent  in  the  baseline 
corpus,  and  the  chance  goes  down  to  about  10%  on  the  web,  regardless  of  the  search  engine.  This  is  consis¬ 
tent  with  Table  2.  Moreover,  the  trigram  has  a  much  larger  chance  in  having  a  small  count  in  the  baseline 
corpus  than  on  the  web.  Since  small  counts  usually  mean  unreliable  estimates,  resorting  to  the  web  could  be 
beneficial. 

2.2.2  The  effective  size  of  the  web 

Recently,  Fienberg  et  al.  [6]  estimated  the  size  of  the  indexable  web  as  of  1997  to  be  close  to  1  billion  pages. 
The  web  grows  exponentially,  and  as  of  this  writing  some  search  engines  claim  they  have  indexed  more  than 
1  billion  pages.  We  would  like  to  estimate  the  effective  size  of  the  web  as  a  language  model  training  corpus. 

Let’s  assume  that  the  web  and  the  baseline  corpus  are  homogeneous  (which  is  patently  false,  since 
the  web  has  much  more  than  news,  but  we  will  ignore  this  for  the  time  being).  Then  the  probability  of  a 
particular  N-gram  appearing  in  the  baseline  corpus  is  the  same  as  the  probability  that  it  appears  on  the  web: 

Pcorpus  (N-gram)  =  pwcb(  N-gram) 

Since  the  probabilities  can  be  approximated  by  their  respective  frequencies,  we  have 

C  corpus  (N-gram)  ^  c,,,^  (N-gram) 

|  corpus  |  |  web  | 

,  from  which  we  can  estimate  |web|,  the  size  of  the  web  in  words.  Note  that  it  doesn’t  matter  if  the  N-gram  is 
a  unigram,  bigram  or  trigram,  though  N-grams  with  small  counts  are  unreliable  and  should  be  excluded.  In 
our  experiment,  we  considered  all  unigrams,  bigrams  and  trigrams  in  the  test  text  with  ccorpus  >10.  Each 
such  N-gram  will  gave  us  an  estimate,  and  we  took  the  median  of  all  these  estimates  for  robustness.  Table  3 
gives  our  estimates  of  the  size  with  different  search  engines. 

Some  points  to  notice: 

1.  The  ’effective  web  size’  estimates  we  obtained  are  very  rough  at  best.  Moreover,  they  are  defined 
relative  to  the  specific  baseline  corpus  and  specific  test  set  we  happened  to  choose.  Therefore,  Table  3 
should  not  be  used  to  rank  the  performance  of  individual  search  engines. 
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Effective  size  of  the  web 

AltaVista 

108  billion  words 

Lycos 

79  billion  words 

FAST 

83  billion  words 

Table  3:  The  effective  size  of  the  web  for  language  model  training 


2.  This  method  tends  to  underestimate  the  web  size.  We  assumed  homogeneity,  which  in  actuality  does 
not  hold.  The  test  text  comes  from  a  news  domain,  and  so  does  the  baseline  corpus.  We  used  N- 
grams  from  the  test  text  to  estimate  the  web  size,  which  gives  rise  to  a  selectional  bias.  Intuitively, 
only  “news  terms”  are  in  the  test  text.  And  since  the  corpus  is  in  news  domain,  as  a  whole  we  have 
P  cor  pus  (news  terms)  >  p„,e;,(news  terms).  This  is  what  leads  to  underestimation. 

2.2.3  Normalization  of  the  web  counts 

An  interesting  sanity  check  is  to  see  whether 

Cweb{WiW2)  =  ^2  Cweb(WiW2W3) 
u's  €  V 

holds  for  any  bigram  ”u’im>2”.  If  this  is  true,  the  relative  frequency  estimation  pWeb{w3\wu  w 2)  would 
already  be  normalized,  i.e. 

Y2  Pweb(W3\Wl1W2)  =  l,Vu?i,W2 

Of  course  there  are  too  many  ”u>iw2w3”  combinations  to  verify  this  directly.  Instead,  we  randomly 
chose  six  ”  uq w?2”  pairs  from  the  baseline  corpus.  For  each  pair,  we  chose  2000  w3’s  according  to  the  fol¬ 
lowing  heuristic:  First,  we  selected  words  from  a  list  of  all  w3’s  such  that  the  trigram  "w1w2w3v  appeared 
in  the  baseline  corpus,  sorted  by  decreasing  frequency;  If  fewer  than  2000  words  were  chosen  that  way,  we 
added  words  from  a  list  of  all  w3  s  such  that  the  bigram  ”  w2w 3”  appeared  in  the  baseline  corpus,  in  decreas¬ 
ing  frequency  order;  If  this  was  still  not  enough,  we  added  w3  s  according  to  their  unigram  frequencies.  We 
expected  this  heuristic  to  give  us  a  list  of  w3s  that  covers  the  majority  of  the  conditional  probability  mass 
given  history  ”  tv\  w2” . 

Table  4  shows  web  bigram  count  estimates  obtained  with  FAST  search,  together  with  their  respective 
cumulative  web  trigram  count  estimates  as  described  above.  Ideally,  the  ratio  should  be  close  to,  but  less 
than,  100%.  It  is  evident  from  the  table  that  the  web  counts  are  not  perfectly  normalized.  The  reasons  are 
not  entirely  clear  to  us,  but  the  fact  that  the  N-gram  counts  are  estimated  from  page  counts  is  an  obvious 
candidate.  The  web  N-gram  count  estimates  should  therefore  be  used  with  caution. 


2.2.4  The  variance  and  bias  of  web  trigram  estimates 


As  stated  earlier,  we  are  interested  in  estimating  conditional  trigram  probabilities  based  on  their  relative 
frequency  on  the  web: 


Pu»e&(w3|wi,U?2 ) 


Cwebjw  lW2U>3) 
cwth(wlw2) 


It  would  be  informative  to  compare  pweb{w3 |u»i,  w2)  to  a  traditional  (corpus  derived)  trigram  probability 
estimate. 
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’*  tC]  U'2" 

Cwcb{u'lW2) 

1^2000 

ratio 

about  seventy 

16498.3 

14807.7 

90% 

and  there’s 

662697.0 

724870.0 

109% 

group  being 

20248.4 

16246.5 

80% 

lewinsky  after 

1431.9 

1631.7 

114% 

two  hundred 

389949.0 

457656.0 

117% 

willy  b. 

1334.6 

607.2 

45% 

Table  4:  Sanity  check:  are  web  counts  normalized? 
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Figure  3:  Ratio  of  web  trigram  estimates  to  corpus  trigram  estimates 


To  this  end,  we  created  a  baseline  trigram  language  model  LA/0  from  the  103  million  word  baseline  cor¬ 
pus.  We  used  modified  Kneser-Ney  smoothing  [7]  [8]  which,  according  to  [8],  is  one  of  the  best  smoothing 
methods  available.  In  building  Z,A/0,  we  discarded  all  singleton  trigrams  in  the  baseline  corpus,  a  common 
practice  to  reduce  language  model  size.  We  denote  L  A  Vs  probability  estimates  by  po. 

With  LM0 ,  we  were  able  to  compare  pu,e&(tr 3I  w’i ,  tr2)  with | ,  tr2).  We  computed  the  ratio  r: 


r(u’j,tr2,  «>3) 


»?2) 

/>0(«’3|W1<  w2) 


between  these  two  estimates.  We  expected  r  to  be  more  spread  out  (having  larger  variance)  when  ccorpus  ( w  1  w2  W3) 
is  small,  since  in  this  case  po(wT3|w’ij  ?r2)  tends  to  be  unreliable. 

We  computed  r(w\,  w2,  w3)  for  every  trigram  in  the  test  text,  excluding  those  with  ci(,rf)(«?1t/?2)  —  0. 

We  plot  r(w i,  w2,  ws)  vs.  cC0rpils(a^ic2ivs)  in  Figure  3.  We  found  that: 

1.  For  trigrams  with  large  ccorpus(rWiiv2ir3),  r  averages  to  about  1.  Thus  the  web  estimates  are  consistent 
with  LMq  in  this  case. 


2.  As  we  expected,  the  variance  of  r  is  largest  when  ccorpus(iriw2irs)=0,  and  decreases  when  it  gets 
large.  Hence  the  ‘funnel’  shape. 

3.  When  ccorpus(ivi  w2w3)  is  small,  especially  0  and  1,  r  is  biased  upward.  This  is  of  course  good  news, 
as  it  suggests  that  this  is  where  the  web  estimates  tend  to  improve  on  the  corpus  estimates. 
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4.  All  the  search  engines  give  similar  results. 


r 


3  Combining  web  estimates  with  existing  language  model 

In  the  previous  section,  we  saw  the  potential  of  the  web:  it  is  huge,  it  has  better  trigram  coverage,  and  its 
trigram  estimates  are  largely  consistent  with  the  corpus-based  estimates.  Nevertheless,  to  query  each  and 
every  N-gram  on  the  web  is  infeasible.  This  prevents  us  from  building  a  full  fledged  language  model  from 
the  web  via  search  engines.  More  over,  Table  4  indicates  that  web  estimates  are  not  well  normalized.  In 
addition,  the  content  of  the  web  is  heterogeneous  and  usually  doesn’t  coincide  with  our  domain  of  interest. 
Based  on  these  considerations,  we  decided  not  to  try  to  build  an  entire  language  model  from  the  web.  Rather, 
we  will  start  from  a  traditional  language  model  LMq,  and  interpolate  its  least  reliable  trigram  estimates  with 
the  appropriate  estimates  from  the  web. 

Unreliable  trigram  estimates,  especially  those  involving  backing  off  to  lower  order  N-grams,  have  been 
shown  to  be  correlated  with  increased  speech  recognition  errors  [9]  [10].  By  going  to  the  much  larger  web 
for  reliable  estimates,  Our  hope  was  to  alleviate  this  problem.  We  used  the  trigram  counts  in  the  baseline 
corpus  as  a  heuristic  to  decide  the  reliability  of  trigram  estimates  in  LMq.  A  trigram  estimate  po(«’3|w>i ,  iv2) 
is  deemed  unreliable,  if 

Ccorpus{w  1W2W3)  <  T 

where  r,  the  ‘reliability  threshold’,  is  a  predetermined  small  positive  integer,  e.g.  1.  Admittedly  this  defini¬ 
tion  of  unreliable  estimates  is  biased. 

Even  with  this  definition,  there  are  still  too  many  unreliable  trigram  estimates  to  query  the  web  for.  Since 
we  were  interested  in  N-best  list  rescoring,  we  further  restricted  the  queries  to  those  unreliable  trigrams  that 
appeared  in  the  particular  N-best  list  being  processed.  This  greatly  reduces  the  number  of  web  queries  at  the 
price  of  some  further  bias.  Let  Um  ,,,2  be  the  set  of  words  that  have  unreliable  trigram  estimates  with  history 
”uqu>2”in  the  current  N-best  list,  i.e. 

Y'ivi w2"  €  N-best  A  cweb{wiw2 )  >  0,  (1) 

Uw i«/2  =  WiW2u'3V  e  N-best  A  cC0rpUS(wiw2W3)  <  t} 

We  obtain  cweb(w  1,  w2,  u),  u  €  UWlW2  and  Ciueb{w1iv2)  via  search  engines,  and  compute  pWeb(u\wi,  w2), 
the  web  relative  frequency  estimates,  from  these  web  counts. 

Letp*(u|u)i,  w2)  denote  the  final  interpolated  estimates,  which  combine  po(w|uq,  w2)  andpu,e6(w|«’i,  w2). 
We  would  like  to  have  a  tunable  parameter  so  that  on  one  extreme  p*(u\wi,  w2)  -4  pa{u\w\,  w2),  while  on 
the  other  extreme  p*(u\wi,  w2)  -4  pweb(u\wi,  w2).  We  now  present  three  different  methods  for  doing  this. 


3.1  Exponential  Models  with  Gaussian  Priors 

We  define  a  set  of  binary  functions,  or  ‘features’,  as  follows: 


fw\  ,U’2,U  (^3)  — 


1  if  U  =  i/';; 

0  otherwise 


for  all  iv  1 ,  w2,  u  £  UW1W2  in  the  N-best  list.  Next,  for  any  given  w1}  w2,  we  define  a  conditional  exponential 
model  p*E  with  these  features: 

Pe(w3\wi,w2)  =  (2) 

j^Po{w3\wi,w2)  exp(E„€t/u,lu,2  Kfwuw2,uiW3)) 


7 


where  p0  is  the  estimate  provided  by  LM0,  A’s  are  parameters  to  be  optimized,  and  ZWl  „,2  is  a  normalization 
factor.  This  model  has  exactly  the  same  form  as  a  conventional  Maximum  Entropy  /  Minimum  Discrim¬ 
inative  Information  (ME/MDI)  model  [1 1]  [12].  Let  A  denote  the  set  of  parameters.  If  we  maximize  the 
likelihood  of  the  web  counts: 


1(A)  =  n 

W i  .U'2,71'3 

with  the  standard  Generalized  Iterative  Scaling  algorithm  (GIS)  [13],  we  get  the  ME/MDI  solution  that 
satisfies  the  following  constraints: 


Pe(u\wu  w2) 


PwM" |«’li  «’2) 

CwebiWli  «’2,  ») 

Cwfb{ll'i,W2) 


Vtri.  u'2.  u  e  r„,„,2 


(3) 


This  corresponds  to  one  extreme  of  the  interpolation.  But  since  we  want  to  control  the  degree  of  interpola¬ 
tion,  we  introduce  a  Gaussian  prior  with  mean  0  and  variance  a2  over  A: 


p(  a) = n 


)  s/'llTCT2 


exp(- 


-A; 


2a2 


And  instead  of  seeking  the  maximum  likelihood  solution,  we  seek  the  maximum  a  posteriori  (MAP)  solution 
that  maximizes 

L{A)*p{A) 

This  can  be  done  by  slightly  modifying  the  GIS  algorithm,  as  described  in  [14]. 

With  this  Gaussian  prior,  we  can  control  the  degree  of  interpolation  by  choosing  the  value  of  a2  £ 
(0,+oo).  a2  acts  as  a  tuning  parameter:  If  a2  -4  +oo,  the  Gaussian  prior  is  flat  and  has  virtually  no 
restriction  on  the  values  of  the  A’s.  Thus  the  A’s  can  reach  their  ME/MDI  solutions,  and  hence  p*E  reaches 
one  extreme  as  in  (3).  On  the  other  hand  if  <j2  -4  0,  the  Gaussian  prior  forces  A’s  to  be  close  to  the 
mean,  which  is  0.  From  (2)  we  know  in  this  case  p*E  -4  p().  This  corresponds  to  the  other  extreme  of  the 
interpolation.  A  a2  between  0  and  +oo  results  in  an  intermediate  p*E  distribution. 

For  the  purpose  of  comparison,  we  experimented  with  two  other  interpolation  methods,  which  are  easy 
to  implement  but  may  be  theoretically  less  well  motivated:  linear  interpolation  and  geometric  interpolation. 


3.2  Linear  Interpolation 

In  linear  interpolation,  we  have 


(1  -  a)po(w3\wi,u'-2)  +  c\pweb{ic3[wi,  w2) 

,  if  m’3  £  f'A’i  v ’2 

1_y)„crr  p!  (ul'Ci  .tc2) 

- — - — - r-j - rPo(M’3  «’l)  «!2) 

,  otherwise 


(4) 


In  this  case,  a  £  [0, 1]  is  the  tuning  parameter.  If  a  =  0,  p*L  =  p0.  If  a  =  1,  p*L  satisfies  (3).  An  o  in 
between  results  in  an  intermediate  p*L. 
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3.3  Geometric  Interpolation 

In  geometric  interpolation,  we  have 


3 


% 


a 


Pg(u’3|«>1,  «>2)  = 

p0('«’3h<’l,  W2)(1_/3) 


1.0 


ev 


l-V) 


■U''l  IC'2 


G  U-w 


Cweb(wjW2'W3)+( 

Ciueb  («)itn2)+|v|«_ 

,  if  £  Uy_l{  W2 

Pg(“IU'1’U,2) 

Po{w3\wu  w2) 


1  lt’2 


Po(w|'W'’l  ,U>2  )  J 


,  otherwise 


(5) 


Note  that  here  we  have  to  smooth  the  web  estimates  to  avoid  zeros  (which  is  not  a  problem  in  the  previous 
two  methods).  To  do  this,  we  simply  add  a  small  positive  value  e  to  the  web  counts.  This  is  known  as 
additive  smoothing  [8].  The  value  of  e  is  determined  to  minimize  the  perplexity  with  3  =  1.  Once  <  is 
chosen  it  is  fixed,  and  we  tune  /3.  f3  G  [0, 1]  is  the  interpolation  parameter.  If  3  =  o,  Pg  =  Po •  If  p  =  1,  P*G 
satisfies  the  smoothed  web  estimates.  A  0  in  between  results  in  an  intermediate  p*G. 


4  Experimental  Result 

We  randomly  selected  200  utterance  segments  from  the  TREC-7  Spoken  Document  Retrieval  track  data  [15] 
as  our  test  set  for  this  experiment.  For  each  utterance  we  have  its  correct  transcript  and  an  N-best  list  with 
N  =  1000,  i.e.  1000  decoding  hypotheses.  We  performed  N-best  list  rescoring  to  measure  the  word  error 
rate  (WER)  improvement,  and  computed  the  perplexity  of  the  transcript.  Note  that  the  test  set  is  relatively 
small  and  N  =  1000  is  not  very  deep,  since  we  wanted  to  limit  the  number  of  web  queries  to  within  a 
practical  range. 

4.1  Word  Error  Rate 

If  we  rescore  the  N-best  lists  with  LM0  and  pick  the  top  hypotheses,  the  WER  is  33.45%.  This  is  our 
baseline  WER.  The  oracle  WER,  i.e.  if  we  were  able  to  pick  the  least  errorful  hypothesis  among  the  1000 
for  each  N-best  list,  is  25.26%.  Of  course  we  cannot  achieve  the  oracle  WER,  but  it  indicates  there  is  room 
for  improvement  over  LMq. 

Since  each  utterance  has  1000  hypotheses  in  the  N-best  list,  the  total  number  of  trigrams  is  very  large. 
Table  5  lists  the  number  of  trigram  tokens  (occurrences)  and  types  (unique  ones)  in  all  the  N-best  lists  com¬ 
bined,  together  with  the  percentage  of  unreliable  trigram  types  and  tokens  as  determined  by  the  reliability 
threshold  r.  Note  that  trigrams  containing  start-of-sentence  or  end-of-sentence  (commonly  designated  by 
<  .s  >  and  <  / s  >)  are  excluded  from  the  table,  since  they  can’t  be  queried  from  the  web.  For  each  N-best 
list,  we  queried  the  unreliable  trigrams  (and  associated  bigrams)  in  the  list,  from  which  we  computed  p* 
with  the  three  different  interpolation  methods.  We  then  used  p*  to  rescore  the  N-best  list,  and  calculated  the 
WER  of  the  top  hypothesis  after  rescoring. 

First,  we  set  the  reliability  threshold  r  =  0,  i.e.  we  regard  only  those  trigrams  that  never  occur  in  the 
baseline  corpus  as  unreliable.  Figure  4(a)  shows  the  WER  with  exponential  models  and  Gaussian  priors. 
The  three  curves  stand  for  different  search  engines,  which  turn  out  to  be  very  similar.  The  horizontal  dashed 
line  is  the  baseline  WER.  As  predicted,  when  the  variance  of  the  Gaussian  prior  a2  — >■  0  (the  left  side 
of  the  figure),  p*E  converges  to  ptJ  and  the  WER  converges  to  the  baseline  WER.  On  the  other  hand  when 
a2  —f  +oc,  the  estimates  of  the  unreliable  trigrams  come  solely  from  the  web.  Such  estimates  seem  inferior 
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WER  WER 


trigram 

total 

reliability  threshold  r 

0 

1 

2 

3 

4 

5 

tokens 

5,311,303 

2,002,530 

37.7% 

2,310,416 

43.5% 

2,496,312 

47.0% 

2,650,340 

49.9% 

2,772,500 

52.2% 

2,889,348 

54.4% 

types 

57,107 

36,190 

63.4% 

39,059 

68.4% 

40,893 

71.6% 

42,158 

73.8% 

43,110 

75.5% 

43,863 

76.8% 

Table  5:  Number  of  unreliable  trigrams  in  the  N-best  lists  •> 


(a)  Exponential  Models  (b)  Linear  Interpolation 


(c)  Geometric  Interpolation  (d)  Reliability  Threshold 


Figure  4:  Word  Error  Rates  of  web-improved  language  models  as  function  of  the  smoothing  parameter  for 
several  different  interpolation  schemes,  based  on  N-best  rescoring 
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and  the  model  has  higher  WER  than  the  baseline.  Between  these  two  extremes,  WER  reaches  minimum 
(32.53%  with  AltaVista)  around  a2  =  1. 

Figure  4(b)  is  the  WER  with  linear  interpolation.  Again,  the  minimum  WER  32.56%  is  reached  between 
the  two  extremes  at  Q'  =  0.4  by  AltaVista. 

To  use  geometric  interpolation,  we  needed  to  choose  a  value  for  e  first.  We  chose  e  =  0.01  because  this 
minimized  the  perplexity  when  [3  =  1.  Next  we  vary  i  while  keeping  e  fixed,  and  plotted  the  WER  of  the 
interpolated  model  in  Figure  4(c).  As  with  the  previous  interpolation  methods,  the  WER  reaches  minimum 
when  the  interpolation  factor  is  near  the  middle.  The  minimum  is  32.69%  when  /3  —  0.3  with  FAST. 

Next,  we  adjusted  the  reliability  threshold  r  and  observe  its  effect  on  WER.  The  interpolation  method 
used  here  is  the  exponential  model  with  Gaussian  prior  and  a2  =  1.  We  varied  r  from  0  to  5.  With  larger 
threshold,  more  trigrams  are  regarded  as  unreliable,  and  hence  more  web  queries  had  to  be  issued.  As 
shown  in  Figure  4(d),  there  is  a  slight  but  definite  improvement  in  WER  when  we  increase  r  from  0  to  1 . 
For  example.  The  WER  with  r  =  1  and  AltaVista  is  32.45%.  Further  increment  results  in  about  the  same 
WER,  averaged  over  search  engines.  Note  that  LMq,  the  language  model  we  are  incorporating  web  estimate 
into,  was  built  after  excluding  all  singleton  trigrams  in  the  corpus.  This  may  explain  why  r  =  1  is  better 
since  trigrams  with  counts  0  or  1  in  the  corpus  are  indeed  unreliable:  in  LMq  they  must  backoff  to  bigram 
or  unigram. 

To  analyze  the  source  of  improvement,  we  broke  down  the  WER  according  to  the  trigram  backoff  modes 
in  LMq.  First,  we  marked  each  word  u>,  in  the  transcript  with  one  of  several  labels,  using  the  following  rules: 
Let  Wi- 2  and  u>i-i  be  the  two  words  preceding  ivt.  If  the  trigram  exists  in  LMq,  label  W{ 

as  ‘3’.  Otherwise  if  the  trigram  doesn’t  exist  in  LMq,  but  the  bigram  ” w^iw''  does,  label  w;  as  ’3-2’, 
meaning  LMq  has  to  backoff  to  the  bigram  for  iv;.  If  the  bigram  doesn’t  exist  in  LMq  either,  label  wt  as 
’3-2-1’  since  LMq  has  to  backoff  to  the  unigram.  In  the  second  step,  we  compared  the  transcript  with  the 
top  hypotheses  after  rescoring  the  N-best  lists  with  po.  Each  word  in  the  transcript  obtains  a  second  label  of 
either  “correct”  or  “wrong”  depending  on  whether  the  word  is  correct  in  the  corresponding  top  hypothesis. 
We  then  collect  the  percentage  of  correct  words  within  categories  ‘3’,  ‘3-2’  and  ‘3-2-1’  respectively.  In  the 
third  step  we  repeated  the  second  step,  except  that  the  top  hypotheses  are  now  obtained  by  rescoring  the 
N-best  lists  with  p*E,  where  u2  =  1,  r  =  1,  and  the  search  engine  is  AltaVista.  We  compare  the  percentage 
of  errors  in  step  2  and  step  3  in  Table  6.  Note  that  insertion  errors  are  not  counted  in  our  error  break  down. 
Not  surprisingly,  the  ‘3-2-1’  category  has  the  highest  error  rate  for  both  pq  and  p*E,  since  the  words  in  this 
category  are  the  hardest  from  the  language  model’s  point  of  view.  The  ‘3-2’  category  has  lower  error  rate, 
and  ‘3’  has  the  lowest.  The  interpolated  language  model  p*E  improves  error  rate  for  all  three  categories, 
compared  to  po.  The  largest  improvement  is  in  the  ‘3-2-1’  category,  which  suggests  the  web  helps  LMq 
most  with  the  hardest  cases.  It  is  not  clear  though  why  the  ‘3-2’  category  is  not  improved  as  much. 


category 

words 

error  rate 

Po 

Pe 

3 

3480 

23.3% 

22.8% 

3-2 

2236 

30.7% 

30.1% 

3-2-1 

479 

50.1% 

46.1% 

Table  6:  Error  break  down  by  LMq  backoff  mode 


4.2  Approximate  Perplexity 

There  are  6195  words  in  the  transcript.  The  baseline  perplexity  of  the  transcript  with  LM 0  is  196.7.  We 
wanted  to  compute  the  perplexity  of  the  transcript  with  different  interpolated  language  models.  We  define 
UWl  w2  in  (1)  based  on  the  transcript.  However  this  introduces  a  subtle  bias:  the  interpolated  models  now 
depend  on  the  transcript.  In  other  words,  we  are  dynamically  choosing  models  according  to  the  words  we 
will  be  predicting.  The  resulting  scores  are  therefore  not  strictly  interpretable  as  probabilities.  For  this 
reason  we  consider  the  perplexities  we  get  on  the  transcript  to  be  approximate  only.  We  still  report  these 
values  in  this  section  because  we  believe  that  the  distortion  is  not  too  severe,  and  the  approximation  still 
provides  useful  insight  into  the  true  perplexity  of  web-improved  language  models.  Note  that,  although  the 
same  kind  of  bias  exists  in  WER  computation,  it  doesn’t  diminish  the  validity  of  the  WER  improvement  we 
get  there,  since  in  classification  it  is  not  the  particular  probability  value  but  the  ranking  that  matters. 

Figure  5(a-c)  compares  different  interpolation  methods  when  the  reliability  threshold  r  =  0.  There  are 
2274  unique  unreliable  trigrams  in  the  transcript.  We  submitted  them  (and  the  corresponding  bigrams)  as 
queries  to  the  search  engines,  and  computed  p*  with  the  three  different  interpolation  methods  described  in 
the  last  sections  respectively.  From  p*  we  computed  the  approximate  perplexities. 

Figure  5(a)  shows  the  approximate  perplexity  with  the  exponential  model  and  a  Gaussian  prior.  Like  the 
WER  in  Figure  4(a),  the  approximate  perplexity  converges  to  the  baseline  when  the  Gaussian  prior  a2  — >  0. 
The  approximate  perplexity  worsens  when  a2  — >  +oc.  The  best  value  156.9  is  achieved  by  FAST  also 
between  these  two  extremes  at  a1  —  1.  Again,  different  search  engines  are  similar. 

Figure  5(b)  is  the  approximate  perplexity  with  linear  interpolation.  It  is  also  similar  to  the  WER  in 
Figure  4(b).  The  minimum  156.2  is  reached  by  FAST  at  o  =  0.15. 

Figure  5(c)  shows  the  approximate  perplexity  with  geometric  interpolation  and  f  =  0.01.  As  with  the 
previous  interpolation  methods,  the  approximate  perplexity  converges  to  the  baseline  when  li  — >■  0  and  is 
worse  when  j3  -*  1.  But  unlike  the  other  methods,  approximate  perplexity  seems  to  be  always  worse  than 
the  baseline,  and  increases  monotonically  with  3. 

Figure  5(d)  compares  the  effect  of  the  reliability  threshold  r  on  the  approximate  perplexity.  As  in 
Figure  4(d),  the  interpolation  method  used  is  exponential  model  with  Gaussian  prior  and  rr2  =  1 .  Again  we 
see  improvement  when  we  increase  r  from  0  to  1 .  For  example,  FAST's  approximate  perplexity  goes  down 
to  147.5.  We  believed  this  can  be  explained  similarly  to  Figure  4(d). 


5  Discussions 

In  this  paper,  we  demonstrated  that  trigram  estimates  obtained  from  the  web  can  significantly  improve  WER 
relative  to  pure  corpus-based  estimates,  even  though  the  web  estimates  are  noisy,  and  the  web  and  the  test 
set  are  not  in  the  same  domain.  We  believe  the  improvement  largely  comes  from  better  trigram  coverage  due 
to  the  sheer  size  of  the  web,  which  acts  as  a  ‘general  English’  knowledge  source.  Interestingly,  which  search 
engine  is  used  doesn’t  make  much  difference.  Furthermore,  which  interpolation  method  is  used  doesn't 
make  much  difference  either  (at  least  for  WER),  as  long  as  an  appropriate  interpolation  parameter  is  chosen. 

Our  method  has  certain  advantages.  Besides  having  better  N-gram  coverage,  the  content  of  the  web  is 
constantly  changing.  Our  method  would  enable  automatic  up-to-date  language  modeling.  However,  there 
are  also  several  disadvantages.  The  most  severe  one  is  the  large  number  of  web  queries.  In  our  experiment, 
we  needed  to  submit  an  average  of  340  queries  to  the  web  for  each  utterance.  This  results  in  heavy  web 
traffic  and  workload  on  the  search  engines,  and  very  slow  rescoring  process.  Another  concern  is  privacy: 
one  may  be  sending  fragments  of  potentially  sensitive  utterances  to  the  web.  Both  problems,  however,  can 
be  partly  solved  by  using  a  web-in-a-box  setting,  i.e.  if  we  have  a  snapshot  of  the  text  content  of  the  whole 
WWW  on  local  storage.  Yet  another  problem  is  the  lack  of  focus  on  domain  specific  language.  This  might 


Figure  5 :  Approximate  perplexity  of  web-improved  language  models  as  function  of  the  smoothing  parameter 
for  several  different  interpolation  schemes. 


be  solved  by  querying  specific  domain  hosts  instead  of  the  whole  web,  although  by  doing  so  the  N-gram 
coverage  may  deteriorate. 

The  method  proposed  in  this  paper  is  only  one  crude  way  of  exploiting  the  web  as  a  knowledge  source 
for  language  modeling.  Instead  of  focusing  on  trigrams,  one  could  look  for  more  complex  phenomena,  e.g. 
semantic  coherence  [16]  among  the  content  words  in  a  hypothesis.  Intuitively,  if  a  hypothesis  has  content 
words  that  ‘go  with  each  other’,  it  is  more  likely  than  one  whose  content  words  seldom  appear  together  in 
a  large  training  text  set.  The  web  +  search  engine  approach  seems  well  suited  for  this  purpose.  We  are 
currently  pursing  this  direction.  v 
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